feat: Add 100% Spark-compatible regex support via codegen dispatcher by andygrove · Pull Request #4239 · apache/datafusion-comet

andygrove · 2026-05-06T13:56:23Z

Which issue does this PR close?

Part of the simplification discussed in #4310.

Rationale for this change

Add support for all Spark regex expressions (rlike, regexp_extract, regexp_extract_all, regexp_instr, regexp_replace, split) with full java.util.regex compatibility by routing them through the Arrow-direct codegen dispatcher introduced in #4417. The dispatcher Janino-compiles Spark's own doGenCode for the expression, so the regex family inherits Spark-identical semantics with no per-expression glue code.

The native Rust regex engine is potentially faster but cannot fully match Java regex semantics (backreferences, lookaround, embedded flags, etc.). Rather than expose users to two orthogonal axes (engine choice plus a per-expression allowIncompatible flag), this PR collapses to a single engine selector.

Configs

spark.comet.exec.regexp.engine in {rust, java}, default java
- java: route the regex expression through the codegen dispatcher so Spark's own doGenCode (backed by java.util.regex.Pattern) runs inside the Comet pipeline for full Spark semantics. Uses spark.comet.exec.scalaUDF.codegen.enabled (also default true); falls back to Spark with an explanatory message when that flag is disabled.
- rust: run the native DataFusion regex engine when an implementation exists. Setting this is itself the opt-in for the semantic differences from Java regex: no separate allowIncompatible flag needed. Expressions without a native implementation (regexp_extract, regexp_extract_all, regexp_instr) fall through to the JVM codegen dispatcher so users still get Comet acceleration with full Spark semantics.
spark.comet.exec.scalaUDF.codegen.enabled now defaults to true (was false). With pure defaults, the regex family runs on the Comet path with Spark-identical semantics, and the DateFormatClass dispatcher path is similarly active. Setting the flag to false reverts to Spark fallback for paths that depend on the dispatcher.
Per-expression disable: each regex expression has a spark.comet.expression.<ClassName>.enabled flag (default true) that disables Comet's serde for that expression. Useful for narrowing a regression or comparing performance on a single operator without touching the engine selector.

What changes are included in this PR?

Add a RegexpRoute helper in strings.scala that each regex serde delegates to. It picks between the native Rust engine, the codegen dispatcher, and Spark fallback based on engine and scalaUDF.codegen.enabled. Under engine=rust, expressions with no native path fall through to the dispatcher rather than to Spark.
For expressions with no native Rust path (regexp_extract, regexp_extract_all, regexp_instr), introduce a CometRegexpCodegenOnly base class. Each serde is a one-line subclass.
For expressions with a native path (rlike, regexp_replace, split), the JVM arm delegates to CometScalaUDF.emitJvmCodegenDispatch. The native arm is unchanged.
Native serdes surface as Incompatible(notes, optedInBy="...engine=rust") so the standard gating in QueryPlanSerde recognizes engine=rust as the opt-in via optedInBy.
Extend SupportLevel.Incompatible with an optedInBy: Option[String] field, plumbed through scalar- and aggregate-expression gating in QueryPlanSerde.
Add the spark.comet.exec.regexp.engine config in CometConf.
Flip the default of spark.comet.exec.scalaUDF.codegen.enabled to true and drop "experimental" language from the regex/codegen docs and config strings.
Remove RegExp.isSupportedPattern (was a placeholder always returning false).
Document the model in docs/source/user-guide/latest/compatibility/regex.md, including the per-expression disable knobs.

How are these changes tested?

CometRegExpJvmSuite: 46 tests covering all six regex expressions with engine=java and the codegen flag enabled, plus a regression test that exercises the engine=rust → JVM dispatcher fallthrough for regexp_extract, regexp_extract_all, and regexp_instr.
9 SQL test files: rlike_{java,rust}.sql, regexp_replace_{java,rust}.sql, split_{java,rust}.sql, regexp_extract.sql, regexp_extract_all.sql, regexp_instr.sql.
CometStringExpressionSuite, CometSqlFileTestSuite, CometCodegenSuite, and CometTemporalExpressionSuite continue to pass; split tests migrated from the legacy per-class allowIncompatible flag to engine=rust.

Migration notes

spark.comet.exec.scalaUDF.codegen.enabled now defaults to true. With pure defaults the regex family and the DateFormatClass dispatcher path run on Comet rather than falling back to Spark. Set the flag to false to restore the old behavior.
The default regex engine changed from rust to java. With the dispatcher now on by default, the regex family runs with full Spark semantics out of the box.
Under engine=rust, regexp_extract, regexp_extract_all, and regexp_instr now fall through to the JVM codegen dispatcher instead of Spark (previously they fell back to Spark because they have no native rust path).
Users who previously set spark.comet.expression.regexp.allowIncompatible=true to enable the rust path should switch to spark.comet.exec.regexp.engine=rust. The per-expression flag is no longer consulted by the regex family.
Users who previously set spark.comet.expression.StringSplit.allowIncompatible=true should likewise switch to spark.comet.exec.regexp.engine=rust.

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

Implements a DataFusion PhysicalExpr that evaluates child expressions, exports the results as Arrow FFI arrays, calls CometUdfBridge.evaluate() via JNI, and imports the output array. Adds datafusion-comet-jni-bridge as a dependency of the spark-expr crate.

… is true

… wording

…UDF class via context classloader Wrap the JNI body in try/finally so input ValueVectors and the result vector are always closed, even when the UDF or arrow export throws. Resolve the CometUDF class through the thread context classloader so user-supplied UDF jars (added via spark.jars) are visible from the bridge.

…ns fall back to Spark When routing RLike through the JVM UDF, reject Literal(null) and patterns that fail Pattern.compile during planning. Both cases now produce withInfo + None, letting Spark evaluate the expression instead of crashing the executor task with PatternSyntaxException or NullPointerException.

Make comet_udf_bridge an Option in JVMClasses so a missing org.apache.comet.udf.CometUdfBridge class (e.g. shading dropped org.apache.comet.udf.*) no longer crashes executor JVM init. The JVM-UDF dispatch path returns a clear ExecutionError when the bridge is unavailable. Also clarify the FFI lifetime contract on the result import.

Replace string literals "rust"/"java" used for the regexp engine selector with named constants on CometConf. Tighten CometRLike.getSupportLevel so it only reports Compatible(None) when the pattern is a Literal, matching the actual constraint enforced by the convert path.

Literal-folded children no longer get expanded to batch-row count before crossing JNI; ColumnarValue::Scalar is materialized at length 1, avoiding an O(rows) copy of values that never vary across the batch. Document the contract on CometUDF: scalar inputs arrive as length-1 vectors, vector inputs at the batch row count, and the result must match the longest input.

…suite

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

andygrove · 2026-05-13T13:18:37Z

@mbutrovich following on from our discussion about configs yesterday, I filed an issue where we can have that discussion. #4310

…nature PR apache#4306 added a numRows parameter to CometUDF.evaluate; merging main into this branch brought in the trait change but the six regexp UDF implementations still used the old single-argument signature, breaking comet-common compilation across all Spark profiles.

…ee-pr-4239 # Conflicts: # spark/src/main/scala/org/apache/comet/udf/RegExpExtractAllUDF.scala # spark/src/main/scala/org/apache/comet/udf/RegExpExtractUDF.scala # spark/src/main/scala/org/apache/comet/udf/RegExpInStrUDF.scala # spark/src/main/scala/org/apache/comet/udf/RegExpLikeUDF.scala # spark/src/main/scala/org/apache/comet/udf/RegExpReplaceUDF.scala # spark/src/main/scala/org/apache/comet/udf/StringSplitUDF.scala

Adds a master switch (default false) for the experimental JVM UDF framework so the Java regex engine cannot be activated without an explicit opt-in. With engine=java but jvmUdf.enabled=false, the six regex serdes return Unsupported with a message naming the master switch instead of silently using either path. Also extends Incompatible with optedInBy: Option[String] so a config (e.g. an engine selector) can serve as a per-expression incompatibility opt-in. Existing allowIncompatible flags continue to work; optedInBy is OR'd into the gating check in QueryPlanSerde. No existing serde uses optedInBy yet — this lays the foundation for the config simplification discussed in apache#4310.

andygrove · 2026-05-19T22:57:37Z

@mbutrovich I pushed some config changes, inspired by our earlier discussions - let me know what you think of the direction

Default engine is now `java` (routes through the JVM UDF when spark.comet.jvmUdf.enabled=true; falls back to Spark otherwise). Setting engine=rust runs the native Rust regex engine and is itself the opt-in for the semantic differences from Java regex — no separate allowIncompatible flag for the regex family. - Remove RegExp.isSupportedPattern (was a placeholder returning false) - Replace per-serde engine checks with a single RegexpRoute helper - Drop redundant *_rust_enabled.sql variants and migrate CometStringExpressionSuite split tests off the legacy per-class allowIncompatible flag

andygrove · 2026-05-20T13:52:34Z

@mbutrovich I pushed some config changes, inspired by our earlier discussions - let me know what you think of the direction

I posted this comment prematurely - I still had local changes. They are pushed now.

Dual-impl regex serdes (rlike, regexp_replace, split) now return Incompatible(notes, optedInBy="spark.comet.exec.regexp.engine=rust") for the native rust path instead of Compatible. The standard QueryPlanSerde gating then sees engine=rust as the opt-in via the optedInBy mechanism introduced earlier, so the incompatibility is visible in EXPLAIN/logs rather than hidden behind a routing-helper short-circuit.

Drop redundant interpolators in COMET_REGEXP_ENGINE doc string and remove the redundant CometConf self-import in CometStringExpressionSuite to satisfy scalafix. Switch existing rlike/regexp_replace tests to opt in via COMET_REGEXP_ENGINE=rust now that the engine selector is the gate for the Rust path, and reformat regex.md via prettier.

Resolve conflicts in pr_build_{linux,macos}.yml by integrating both the new codegen-suite additions from main and the CometRegExpJvmSuite from the PR, dropping the obsolete standalone "sql" matrix entry that main folded into the "spark" matrix. Resolve CometConf.scala by retaining both COMET_JVM_UDF_ENABLED / COMET_REGEXP_ENGINE from the PR and COMET_SCALA_UDF_CODEGEN_ENABLED from main. The follow-up refactor drops COMET_JVM_UDF_ENABLED in favor of COMET_SCALA_UDF_CODEGEN_ENABLED.

…of hand-written UDFs Replace the six hand-written `RegExp*UDF` / `StringSplitUDF` JVM UDF implementations with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` for the expression, so the regex family inherits Spark-identical semantics with no per-expression glue code. Changes: - Delete `spark/src/main/scala/org/apache/comet/udf/RegExp*UDF.scala` and `StringSplitUDF.scala`. Their behavior is now provided by Spark's `doGenCode` running inside the dispatcher. - Rewrite the regex serdes in `strings.scala`. Expressions with no native Rust path (`RegExpExtract`, `RegExpExtractAll`, `RegExpInStr`) share a new `CometRegexpCodegenOnly` base; expressions with a native path (`RLike`, `RegExpReplace`, `StringSplit`) keep an explicit route table where the JVM arm now delegates to `CometScalaUDF.emitJvmCodegenDispatch`. - Drop the `spark.comet.jvmUdf.enabled` config. The codegen dispatcher already has its own master switch (`spark.comet.exec.scalaUDF.codegen.enabled`); gating the regex family on the same flag avoids two flags for the same path. `spark.comet.exec.regexp.engine` keeps the `java`/`rust` selector semantics, and `engine=java` now requires the codegen flag. - Revert the native Rust additions in `jvm_udf/mod.rs` and `jni-bridge/src/lib.rs`. The codegen dispatcher constructs Arrow output fields JVM-side via `CometBatchKernelCodegenOutput.toFfiArrowField`, so the list-vector field-name normalization cast is unnecessary. - Update `CometRegExpJvmSuite`, `CometRegExpBenchmark`, the regex SQL test fixtures, and the regex compatibility doc to reflect the new gating. Test plan: - `CometRegExpJvmSuite`: 45/45 pass (covers all six regex expressions through the codegen dispatcher). - `CometSqlFileTestSuite`: 289/289 pass. - `CometStringExpressionSuite`: 33/33 pass. - `CometCodegenSuite`: 60/60 pass. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.

andygrove · 2026-05-26T22:06:01Z

@mbutrovich As discussed, I refactored this PR to use codegen dispatch.

The per-expression spark.comet.expression.regexp.allowIncompatible flag is no longer consulted by the regex family. Switch to engine=rust so the RLike serde reaches convertViaNativeRegex and emits the 'Only scalar regexp patterns are supported' fallback message the test asserts on.

Flip the default of spark.comet.exec.scalaUDF.codegen.enabled to true and drop "experimental" language across the regex/codegen docs and CometConf strings. With this default, the regex family (java engine path) and the DateFormat dispatcher route through Comet's Arrow-direct codegen kernel out of the box, so users see Comet acceleration for regex and complex date formatting without per-conf opt-in. The sentinel guard in CometSqlFileTestSuite still keys off the explicit "=true" opt-in: most expression fixtures use their own native paths and do not exercise the dispatcher, so we leave that scope unchanged.

…nted regex Previously engine=rust returned Spark fallback for regexp_extract, regexp_extract_all, and regexp_instr because they have no native Rust path. With the codegen dispatcher now enabled by default, prefer the JVM dispatcher over Spark in that case so users still get Comet acceleration with full Spark semantics. Only decline to native and dispatcher are both unavailable. Also document the per-expression spark.comet.expression.<ClassName>.enabled disable knobs in the regex compatibility guide, and add a regression test that exercises the new rust→JVM fallthrough.

Flip spark.comet.exec.scalaUDF.codegen.enabled back to false and restore the experimental, disabled-by-default language across the regex/codegen docs and CometConf strings. With this default, the regex family (java engine path) and the DateFormat dispatcher fall back to Spark unless the user explicitly opts in. This keeps the engine=rust JVM-dispatcher fallthrough behavior introduced separately on this branch; only the codegen-enabled-by-default change is reverted.

Remove the remaining experimental/disabled-by-default framing from regex.md so the Java engine reads as a normal, supported regex engine gated behind spark.comet.exec.scalaUDF.codegen.enabled.

andygrove added 30 commits April 30, 2026 06:17

docs: add implement-comet-expression Claude skill

9cd1566

docs: reference PR template and add skill-acknowledgement note

953cb86

docs: check datafusion-spark crate before writing native code

422d2b3

Merge branch 'add-implement-expression-skill'

88f2331

feat: add CometUDF trait for JVM-side scalar UDFs

eb8aa14

feat: add RegExpLikeUDF using java.util.regex.Pattern

60a2ecd

Also fix CometArrayExpressionSuite compilation by qualifying the Spark udf() call, which was shadowed by the new org.apache.comet.udf package.

feat: add CometUdfBridge JNI entry point for native UDF dispatch

633b75e

feat: add JvmScalarUdf proto message for JVM UDF dispatch

1c64070

feat: register CometUdfBridge in JVMClasses for native UDF dispatch

8f78436

feat: wire JvmScalarUdf proto into native planner

d8ab411

feat: add spark.comet.exec.regexp.useJVM config

4970c9c

feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…

54ddd50

… is true

test: add end-to-end suite for JVM-backed RLike

0a942ad

fix: use project-wide CometArrowAllocator in RegExpLikeUDF

fbfc158

docs: correct CometUdfBridge thread cache lifetime comment

909ab91

docs: document from_ffi consumption invariant in JvmScalarUdfExpr

862ed2e

style: apply make format

a943de5

docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…

e1b9b2a

… wording

test: add CometRegExpBenchmark covering all rlike modes

76418c6

ci: register new RLike JVM-bridge test suites in PR workflows

8ac45be

build: exclude docs/superpowers from rat and git

a1f8ecf

remove skill

23a9e52

refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)

1c66f44

test: cover empty and all-null subject vectors in RegExpLikeUDF unit …

85029c5

…suite

Merge remote-tracking branch 'apache/main' into java-regexp

b55adb0

# Conflicts: # spark/src/main/scala/org/apache/comet/serde/strings.scala

andygrove moved this to In progress in Comet Development May 13, 2026

andygrove added this to Comet Development May 13, 2026

andygrove added this to the 0.17.0 (June 2026) milestone May 13, 2026

andygrove mentioned this pull request May 14, 2026

feat(datetime): prototype JVM UDF path for Hour/Minute/Second (engine=java) #4321

Closed

mbutrovich mentioned this pull request May 15, 2026

feat: support stateful CometUDFs #4345

Merged

andygrove added 2 commits May 19, 2026 08:00

andygrove added 4 commits May 20, 2026 07:59

andygrove changed the title ~~feat: add experimental support for Spark regexp expressions via JVM UDF framework~~ feat: experimental Spark regex support via codegen dispatcher May 26, 2026

mbutrovich self-requested a review May 28, 2026 17:43

andygrove changed the title ~~feat: experimental Spark regex support via codegen dispatcher~~ feat: Add 100% Spark-compatible regex support via codegen dispatcher May 28, 2026

andygrove added 6 commits May 28, 2026 14:02

style: prettier-format regex compatibility doc

bb2c641

style: drop unused interpolator on CometConf regexp engine doc

ca61034

Merge branch 'main' into java-regexp

7f22f92

docs: drop experimental language from regex compatibility guide

7be7783

Remove the remaining experimental/disabled-by-default framing from regex.md so the Java engine reads as a normal, supported regex engine gated behind spark.comet.exec.scalaUDF.codegen.enabled.

andygrove mentioned this pull request Jun 2, 2026

Comet 0.17.0 Release #4564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add 100% Spark-compatible regex support via codegen dispatcher#4239

feat: Add 100% Spark-compatible regex support via codegen dispatcher#4239
andygrove wants to merge 65 commits into
apache:mainfrom
andygrove:java-regexp

andygrove commented May 6, 2026 •

edited

Loading

Uh oh!

andygrove commented May 13, 2026

Uh oh!

andygrove commented May 19, 2026

Uh oh!

andygrove commented May 20, 2026

Uh oh!

andygrove commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Configs

What changes are included in this PR?

How are these changes tested?

Migration notes

Uh oh!

andygrove commented May 13, 2026

Uh oh!

andygrove commented May 19, 2026

Uh oh!

andygrove commented May 20, 2026

Uh oh!

andygrove commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 6, 2026 •

edited

Loading